Goto

Collaborating Authors

 Object-Oriented Architecture


Interaction Centric Knowledge Infusion and Transfer for Open Vocabulary Scene Graph Generation

Neural Information Processing Systems

Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline: 1) Infusing knowledge into large-scale models via pre-training on large datasets; 2) Transferring knowledge from pre-trained models with fully annotated scene graphs during supervised fine-tuning. However, due to a lack of explicit interaction modeling, these methods struggle to distinguish between interacting and non-interacting instances of the same object category. This limitation induces critical issues in both stages of OVSGG: it generates noisy pseudo-supervision from mismatched objects during knowledge infusion, and causes ambiguous query matching during knowledge transfer. To this end, in this paper, we propose an interACtion-Centric end-to-end OVSGG framework (ACC) in an interaction-driven paradigm to minimize these mismatches. For interactioncentric knowledge infusion, ACC employs a bidirectional interaction prompt for robust pseudo-supervision generation to enhance the model's interaction knowledge. For interaction-centric knowledge transfer, ACC first adopts interaction-guided query selection that prioritizes pairing interacting objects to reduce interference from non-interacting ones. Then, it integrates interaction-consistent knowledge distillation to bolster robustness by pushing relational foreground away from the background while retaining general knowledge. Extensive experimental results on three benchmarks show that ACC achieves state-of-the-art performance, demonstrating the potential of interaction-centric paradigms for real-world applications.


LiteReality: Graphics-Ready 3DScene Reconstruction from RGB-DScans

Neural Information Processing Systems

We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines--such as object individuality, articulation, high-quality physically based rendering materials. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and objects, with the help of a structured scene graph.


2 (a) Visual Domain View (RGB) (b) Spectral Domain View (MSI)

Neural Information Processing Systems

Drone-based multi-object tracking is essential yet highly challenging due to small targets, severe occlusions, and cluttered backgrounds. Existing RGB-based multiobject tracking algorithms heavily depend on spatial appearance cues such as color and texture, which often degrade in aerial views, compromising tracking reliability. Multispectral imagery, capturing pixel-level spectral reflectance, provides crucial spectral cues that significantly enhance object discriminability under degraded spatial conditions. However, the lack of dedicated multispectral UAV datasets has hindered progress in this domain. To bridge this gap, we introduce MMOT, the first challenging benchmark for drone-based multispectral multi-object tracking dataset. It features three key characteristics: (i) Large Scale -- 125 video sequences with over 488.8K annotations across eight object categories; (ii) Comprehensive Challenges -- covering diverse real-world challenges such as extreme small targets, high-density scenarios, severe occlusions, and complex platform motion; and (iii) Precise Oriented Annotations -- enabling accurate localization and reduced object ambiguity under aerial perspectives.


VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion

Neural Information Processing Systems

Current perception models have achieved remarkable success by leveraging largescale labeled datasets, but still face challenges in open-world environments with novel objects. To address this limitation, researchers introduce open-set perception models to detect or segment arbitrary test-time user-input categories. However, open-set models rely on human involvement to provide predefined object categories as input during inference. More recently, researchers have framed a more realistic and challenging task known as open-ended perception that aims to discover unseen objects without requiring any category-level input from humans at inference time. Nevertheless, open-ended models suffer from low performance compared to openset models.


Generalizable Hand-Object Modeling from Monocular RGBImages via 3DGaussians

Neural Information Processing Systems

Recent advances in hand-object interaction modeling have employed implicit representations, such as Signed Distance Functions (SDF) and Neural Radiance Fields (NeRF) to reconstruct hands and objects with arbitrary topology and photo-realistic detail. However, these methods often rely on dense 3D surface annotations, or are tailored to short clips constrained in motion trajectories and scene contexts, limiting their generalization to diverse environments and movement patterns.


couch 150 200 50 0 d

Neural Information Processing Systems

To encode structure, FactoredScenes learns are dra a wn, library then of uses functions large language capturing models reusable to generate layout patterns high-lev from el programs, which scenes regularized a program-conditioned by the learned library model . T to o represent hierarchically scene predict variations, object FactoredScenes poses, and retrie learns ves and real-w places orld 3D rooms objects that in are a dif scene.


Video Frames Dynamic Content (Moving Object & Camera) Annotations Object Mask Object & Category Caption Scene Camera Caption

Neural Information Processing Systems

Understanding structure, real-w the orld dynamic motion, ph and ysical semantic world, content characterized with textual by its descriptions, evolving 3D is crucial for human-agent interaction and enables embodied agents to perceive and act datasets within are real often en deri vironments ved from with limited human simulators -like capabilities.



Flow: An Efficient Multi-frame Scene Flow Estimation Method

Neural Information Processing Systems

While recent trends shift towards multi-frame reasoning, they suffer from rapidly escalating computational costs as the number of frames grows. To leverage temporal information more efficiently, we propose DeltaFlow ( Flow), a lightweight 3D framework that captures motion cues via a scheme, extracting temporal features with minimal computational cost, regardless of the number of frames. Additionally, scene flow estimation faces challenges such as imbalanced object class distributions and motion inconsistency. To tackle these issues, we introduce a Category-Balanced Loss to enhance learning across underrepresented classes and an Instance Consistency Loss to enforce coherent object motion, improving flow accuracy. Extensive evaluations on the Argoverse 2, Waymo and nuScenes datasets show that Flow achieves state-of-the-art performance with up to 22% lower error and 2 faster inference compared to the next-best multi-frame supervised method, while also demonstrating a strong cross-domain generalization ability.


Object-Centric Concept-Bottlenecks

Neural Information Processing Systems

Developing high-performing, yet interpretable models remains a critical challenge in modern AI. Concept-based models (CBMs) attempt to address this by extracting human-understandable concepts from a global encoding (e.g., image encoding) and then applying a linear classifier on the resulting concept activations, enabling transparent decision-making. However, their reliance on holistic image encodings limits their expressiveness in object-centric real-world settings and thus hinders their ability to solve complex vision tasks beyond single-label classification. To tackle these challenges, we introduce Object-Centric Concept Bottlenecks (OCB), a framework that combines the strengths of CBMs and pre-trained object-centric foundation models, boosting performance and interpretability. We evaluate OCB on complex image datasets and conduct a comprehensive ablation study to analyze key components of the framework, such as strategies for aggregating object-concept encodings. The results show that OCB outperforms traditional CBMs and allows one to make interpretable decisions for complex visual tasks.